Adjusting speed of human speech playback

ABSTRACT

A system configured to vary a speech speed of speech represented in input audio data without changing a pitch of the speech. The system may vary the speech speed based on a number of different inputs, including non-audio data, data associated with a command, or data associated with the voice message itself. The non-audio data may correspond to information about an account, device or user, such as user preferences, calendar entries, location information, etc. The system may analyze audio data associated with the command to determine command speech speed, identity of person listening, etc. The system may analyze the input audio data to determine a message speech speed, background noise level, identity of the person speaking, etc. Using all of these inputs, the system may dynamically determine a target speech speed and may generate output audio data having the target speech speed.

CROSS-REFERENCE TO RELATED APPLICATION

This application is a continuation of, and claims the benefit of, U.S.Non-provisional patent application Ser. No. 15/677,659, filed Aug. 15,2017, and entitled “ADJUSTING SPEED OF HUMAN SPEECH PLAYBACK”, andscheduled to issue on Apr. 30, 2019 as U.S. Pat. No. 10,276,185, whichis expressly incorporated herein by reference in its entirety.

BACKGROUND

With the advancement of technology, the use and popularity of electronicdevices has increased considerably. Electronic devices are commonly usedto capture and process audio data.

BRIEF DESCRIPTION OF DRAWINGS

For a more complete understanding of the present disclosure, referenceis now made to the following description taken in conjunction with theaccompanying drawings.

FIG. 1 illustrates a system according to embodiments of the presentdisclosure.

FIG. 2 is a diagram of components of a system according to embodimentsof the present disclosure.

FIGS. 3A-3D illustrate a conceptual diagram of how adjusting a speed ofhuman speech playback is performed along with examples of input data,command speech data and message data used to adjust the speech speedaccording to examples of the present disclosure.

FIGS. 4A-4B are flowcharts conceptually illustrating example methods foradjusting a speed of human speech playback according to examples of thepresent disclosure.

FIG. 5 illustrates an example of applying different speech speedmodification variables to different portions of input audio dataaccording to examples of the present disclosure.

FIG. 6 illustrates an example of incrementally changing speech speedmodification variables to avoid distortion according to examples of thepresent disclosure.

FIG. 7 illustrates examples of modifying a speech speed and insertingadditional pauses in output audio data according to examples of thepresent disclosure.

FIGS. 8A-8B illustrate examples of modifying a volume of input audiodata in conjunction with modifying a speech speed according to examplesof the present disclosure.

FIG. 9 illustrates an example of identifying speech from multiple usersand applying different speech speed modification variables based on theuser according to examples of the present disclosure.

FIGS. 10A-10B are block diagrams conceptually illustrating examplecomponents of a system for voice enhancement according to embodiments ofthe present disclosure.

DETAILED DESCRIPTION

Electronic devices may be used to capture and process audio data thatincludes speech. The audio data may be sent as a voice message or aspart of a communication session, such as a voice over internet protocol(VoIP) telephone call, a videoconference or the like. The speech may bedifficult to understand for a number of reasons, such as being too fastor too slow, the talker having an accent, variations in loudness ofdifferent words or sentences, presence of background noise, or the like.Sometimes, only a portion of the speech is difficult to understand, suchas important information corresponding to a name, an address, a phonenumber or the like. The audio data may be processed to improve playback,which includes speeding up or slowing down the speech represented in theaudio data without shifting a pitch of the speech. Thus, during playbackof the voice message or the communication session (e.g., on thereceiving side), a modified speech speed may be faster or slower than anoriginal speech speed. However, choosing an undesired speech speed maynegatively impact playback of the audio data, and the desired speechspeed may vary throughout the voice message and/or communicationsession.

To improve playback of the audio data, devices, systems and methods aredisclosed that perform normalization of human speech playback anddynamically adjust a target speech speed. For example, the system maydynamically adjust the target speech speed based on a number of inputsassociated with input audio data, including non-audio data (e.g., inputdata), data associated with a command (e.g., command speech data), ordata associated with the voice message itself (e.g., message speechdata). The input data may correspond to information about an account,device or user, such as user preferences, calendar entries, locationinformation, or the like. The system may analyze audio data associatedwith the command to determine the command speech data (e.g., commandspeech speed, identity of person listening, etc.) and/or may analyze theinput audio data to determine the message speech data (e.g., messagespeech speed, background noise level, identity of the person speaking,etc.). Using all of these inputs, the system may dynamically determine atarget speech speed and may generate output audio data having the targetspeech speed. In some examples, the system may adjust portions of theinput audio data to have different target speech speeds, add additionalpauses, adjust a volume of the speech, separate speech associated withdifferent people, determine different target speech speeds for thedifferent people, or the like. The system may adjust a speed of humanspeech playback on voice messages and/or communication sessions (e.g.,VoIP telephone calls, video conversations or the like) without departingfrom the disclosure.

FIG. 1 illustrates a high-level conceptual block diagram of a system 100configured to adjust a speed of human speech playback. Although FIG. 1,and other figures/discussion illustrate the operation of the system in aparticular order, the steps described may be performed in a differentorder (as well as certain steps removed or added) without departing fromthe intent of the disclosure. As illustrated in FIG. 1, the system 100may include a Voice over Internet Protocol (VoIP) device 30, a publicswitched telephone network (PSTN) telephone 20 connected to an adapter22, a first device 110 a, a second device 110 b and/or a server(s) 120,which may all be communicatively coupled to network(s) 10.

The VoIP device 30, the PSTN telephone 20, the first device 110 a and/orthe second device 110 b may communicate with the server(s) 120 via thenetwork(s) 10. For example, one or more of the VoIP device 30, the PSTNtelephone 20, the first device 110 a and the second device 110 b maysend audio data to the server(s) 120 via the network(s) 10, such as avoice message or audio during a communication session. While notillustrated in FIG. 1, the audio data may be associated with video data(e.g., video message, video communication session, etc.) withoutdeparting from the disclosure.

The VoIP device 30 may be an electronic device configured to connect tothe network(s) 10 and to send and receive data via the network(s) 10,such as a smart phone, tablet or the like. Thus, the VoIP device 30 maysend audio data to and/or receive audio data from the server(s) 120,either during a VoIP communication session or as a voice message. Incontrast, the PSTN telephone 20 may be a landline telephone (e.g., wiredtelephone, wireless telephone or the like) connected to the PSTN (notillustrated), which is a landline telephone network that may be used tocommunicate over telephone wires, and the PSTN telephone 20 may not beconfigured to directly connect to the network(s) 10. Instead, the PSTNtelephone 20 may be connected to the adapter 22, which may be configuredto connect to the PSTN and to transmit and/or receive audio data usingthe PSTN and configured to connect to the network(s) 10 (using anEthernet or wireless network adapter) and to transmit and/or receivedata using the network(s) 10. Thus, the PSTN telephone 20 may use theadapter 22 to send audio data to and/or receive audio data from thesecond device 110 b during either a VoIP communication session or as avoice message.

The first device 110 a and the second device 110 b may be electronicdevices configured to send audio data to and/or receive audio data fromthe server(s) 120. The device(s) 110 may include microphone(s) 112,speakers 114, and/or a display 116. For example, FIG. 1 illustrates thesecond device 110 b including the microphone(s) 112 and the speakers114, while the first device 110 a includes the microphone(s) 112, thespeakers 114 and the display 116. While the second device 110 b isillustrated as a speech-controlled device without the display 116, thedisclosure is not limited thereto and the second device 110 b mayinclude the display 116 without departing from the disclosure. Using themicrophone(s) 112, the device(s) 110 may capture audio data and send theaudio data to the server(s) 120.

While the server(s) 120 may receive audio data from multiple devices,for ease of explanation the disclosure illustrates the server(s) 120receiving audio data from a single device at a time. For example, afirst user may be associated with one of the VoIP device 30, the PSTNtelephone 20, or the first device 110 a and may send audio data to asecond user associated with the second device 110 b. In some examples,the audio data is associated with a one way exchange (e.g., voicemessage), such that the first device 110 a sends the audio data to theserver(s) 120 at a first time and the second device 110 b receiving theaudio data from the server(s) 120 at a second time, with a gap betweenthe first time and the second time not corresponding to processingand/or networking delays. In other examples, the audio data may beassociated with a two-way exchange, such as a real-time communicationsession (e.g., VoIP telephone conversation, video conversation or thelike) in which the first device 110 a sends the audio data to theserver(s) 120 and the second device 110 b receives the audio data fromthe server(s) 120 at roughly the same time, after slight processingand/or networking delays.

The server(s) 120 may be configured to receive input audio data andadjust a speed of human speech playback of the input audio data, as willbe discussed in greater detail below, prior to sending output audio datato the second device 110 b for playback. For example, the server(s) 120may process the input audio data to improve playback, which includesspeeding up or slowing down speech represented in the input audio datawithout shifting a pitch of the speech. Thus, a modified speech speedrepresented in the output audio data may be faster or slower than anoriginal speech speed represented in the input audio data.

As used herein, “speech speed” (e.g., rate of speed associated withspeech included in audio data) refers to a speech tempo, which is ameasure of the number of speech units of a given type produced within agiven amount of time. Speech speed may be measured using words perminute (wpm) or syllables per second (syl/sec), although the disclosureis not limited thereto. For example, the server(s) 120 may identifyportions of the audio data that correspond to individual words and maydetermine the rate of speed of speech by determining a number of wordsspoken per minute. Additionally or alternatively, the server(s) 120 mayidentify portions of the audio data that correspond to individualsyllables and may determine the rate of speed of speech by determining anumber of syllables spoken per second. An original speech speed refersto the rate of speed at which the original speaker was talking, whereasa target speech speed refers to the rate of speed that is output by theserver(s) 120 after adjustment. For example, the server(s) 120 maydetermine an original speech speed (e.g., 100 words per minute, or wpm)associated with first audio data, determine a target speech speed (e.g.,150 wpm) and may modify the first audio data to generate second audiodata having the target speech speed.

The server(s) 120 may determine a target speech speed based on a numberof inputs associated with the input audio data. For example, theserver(s) 120 may dynamically adjust the target speech speed based onnon-audio data (e.g., input data), data associated with a command (e.g.,command speech data), or data associated with the voice message itself(e.g., message speech data), which will be discussed in greater detailbelow with regard to FIGS. 3A-3D. The input data may correspond toinformation about an account, device and/or user, such as userpreferences (e.g., playback speed preferences) based on a user profileassociated with the user, calendar entries (e.g., calendar data),location information (e.g., location data), or the like. The server(s)120 may analyze audio data associated with the command to determine thecommand speech data (e.g., command speech speed, identity of personlistening, etc.) and/or may analyze the input audio data to determinethe message speech data (e.g., message speech speed, background noiselevel, identity of the person speaking, etc.).

Using all of these inputs, the server(s) 120 may dynamically determine atarget speech speed and may generate output audio data having the targetspeech speed. In some examples, the server(s) 120 may adjust portions ofthe input audio data to have different target speech speeds, addadditional pauses, adjust a volume of the speech, separate speechassociated with different people, determine different target speechspeeds for the different people, or the like, without departing from thedisclosure.

To determine the target speech speed, a speech speed modificationcomponent in the server(s) 120 and/or the device 110 may implement oneor more machine learning models. For example, the input data, thecommand speech data, and/or the message speech data may be input to thespeech speed modification component, which outputs the target speechspeed. A ground truth may be established for purposes of training theone or more machine learning models. In machine learning, the term“ground truth” refers to the accuracy of a training set's classificationfor supervised learning techniques.

Various machine learning techniques may be used to train and operate thespeech speed modification component. Such techniques may includebackpropagation, statistical learning, supervised learning,semi-supervised learning, stochastic learning, or other knowntechniques. Such techniques may more specifically include, for example,neural networks (such as deep neural networks and/or recurrent neuralnetworks), inference engines, trained classifiers, etc. Examples oftrained classifiers include Support Vector Machines (SVMs), neuralnetworks, decision trees, AdaBoost (short for “Adaptive Boosting”)combined with decision trees, and random forests. Focusing on SVM as anexample, SVM is a supervised learning model with associated learningalgorithms that analyze data and recognize patterns in the data, andwhich are commonly used for classification and regression analysis.Given a set of training examples, each marked as belonging to one of twocategories, an SVM training algorithm builds a model that assigns newexamples into one category or the other, making it a non-probabilisticbinary linear classifier. More complex SVM models may be built with thetraining set identifying more than two categories, with the SVMdetermining which category is most similar to input data. An SVM modelmay be mapped so that the examples of the separate categories aredivided by clear gaps. New examples are then mapped into that same spaceand predicted to belong to a category based on which side of the gapsthey fall on. Classifiers may issue a “score” indicating which categorythe data most closely matches. The score may provide an indication ofhow closely the data matches the category. The user response to contentoutput by the system may be used to further train the machine learningmodel(s).

As illustrated in FIG. 1, the server(s) 120 may receive (130) a commandto play a voice message and may receive (132) input audio datacorresponding to the voice message. For example, the server(s) 120 mayreceive a command from the second device 110 b instructing the server(s)120 to send the voice message to the second device 110 b for playback.Additionally or alternatively, the server(s) 120 may receive commandaudio data from the second device 110 b and may perform automatic speechrecognition (ASR), natural language understanding (NLU) or the like todetermine that the command audio data corresponds to a commandinstructing the server(s) 120 to send the voice message to the seconddevice 110 b for playback.

The server(s) 120 may receive (134) input data associated with thecommand. The input data may be non-audio data that corresponds toinformation about an account, the second device 110 b and/or user, suchas user preferences, calendar entries, location information or the like.The input data is not determined based on the command audio data or themessage audio data, but is received from the second device 110 b and/orfrom a database associated with the account, the second device 110 band/or the user. While examples of the input data are illustrated inFIG. 3B, the disclosure is not limited thereto and the input data mayinclude any information that the server(s) 120 may use to determine atarget speech speed and/or adjust a speed of human speech playback.

The server(s) 120 may generate (136) message speech data based on theinput audio data. The message speech data is associated with speechrepresented in the input audio data. For example, the server(s) 120 mayanalyze the input audio data to determine information such as a messagespeech speed, background noise level, identity of the person speaking,and/or the like. Thus, whereas the input data is information associatedwith the account, the second device 110 b and/or the user (e.g., basedon a user profile associated with the user), the message speech datarefers to information derived from the input audio data itself, as willbe discussed in greater detail below with regard to FIG. 3D.

The server(s) 120 may determine (138) an original speech speed (e.g.,message speech speed). In some examples, the server(s) 120 may determinedifferent original speech speeds associated with different portions ofthe input audio data, such as when the user speeds up or slows downwhile leaving the voice message. In some examples, the server(s) 120determine the original speech speed as part of generating the messagespeech data and step 138 refers to identifying the original speech speedassociated with a portion of the input audio data that is currentlybeing processed by the server(s) 120. However, the disclosure is notlimited thereto and the server(s) 120 may determine the original speechspeed separately from determining the message speech data withoutdeparting from the disclosure.

The server(s) 120 may determine (140) a target speech speed based on theinput data and the message speech data, may determine (142) a speechspeed modification factor (e.g., speech speed modification variable)based on the original speech speed and the target speech speed and maygenerate (144) output audio data using the speech speed modificationfactor. In some examples, the server(s) 120 may determine the targetspeech speed in words per minute (wpm) and may divide the target speechspeed by the original speech speed to determine the speech speedmodification factor. For example, if the target speech speed is 120 wpmand the original speech speed is 100 wpm, the speech speed modificationfactor is equal to 1.2× (e.g., second speech speed associated with theoutput audio data is 1.2 times faster than a first speech speedassociated with the input audio data). Similarly, if the target speechspeed is 120 wpm and the original speech speed is 150 wpm, the speechspeed modification factor is equal to 0.8× (e.g., second speech speedassociated with the output audio data is 0.8 times as fast as the firstspeech speed associated with the input audio data).

In the examples illustrated above, the server(s) 120 determines a speechspeed modification factor (e.g., multiplier), such as 1.2× or 0.8×. Theserver(s) 120 may determine the speech speed modification factor bydividing the target speech speed by the original speech speed. For easeof explanation, the disclosure will refer to determining and applying aspeech speed modification factor to adjust a speech speed. However, thedisclosure is not limited to determining a multiplier and the server(s)120 may instead determine a speech speed modification variable withoutdeparting from the disclosure. Thus, any reference to a speech speedmodification factor may instead refer to a speech speed modificationvariable without departing from the disclosure. A speech speedmodification variable indicates a relationship between the target speechspeed and the original speech speed, enabling the server(s) 120 toachieve the target speech speed by applying the speech speedmodification variable to the input audio data. Thus, the server(s) 120may modify the input audio data using the speech speed modificationvariable to generate the output audio data.

The server(s) 120 may take into consideration a number of differentcriteria in determining the target speech speed. In some examples, theserver(s) 120 may associate a target speech speed with an identity. Forexample, the server(s) 120 may detect the identity in the message speechdata (e.g., identity of the person speaking in the voice message) andmay determine the target speech speed based on previous settings or userpreferences associated with the identity (e.g., person speaks fast andthe target speech speed should be slower). Additionally oralternatively, the server(s) 120 may determine that the identity isassociated with the command (e.g., identity of the listener) and maydetermine the target speech speed based on previous settings or userpreferences associated with the listener (e.g., listener prefersspeeding up voice messages).

In some examples, the server(s) 120 may determine the target speechspeed based on an estimated urgency. For example, the server(s) 120 maydetermine a range of values for the target speech speed and use theestimated urgency to select from within the range of values. Theserver(s) 120 may determine the estimated urgency based on the inputdata (e.g., calendar entries, location information of listener, numberof voice messages, etc.) and/or the command speech data (e.g., speechspeed of request for playback of voice messages, content analysis of therequest, etc.). For example, if the listener requests voice messages enroute to a location (e.g., almost to work or home) and/or prior to anupcoming calendar event, the server(s) 120 may determine that theestimated urgency is high and may increase the target speech speed.Additionally or alternatively, if the speech speed of the request forplayback of voice messages is fast and/or the number of voice messagesis high, the server(s) 120 may determine that the estimated urgency ishigh and may increase the target speech speed. Similarly, if theserver(s) 120 detects an incoming communication (e.g., telephone call,communication session, etc.) or the presence of a guest (e.g.,identifying an additional person speaking in the command audio data,detecting an additional face using facial recognition, etc.), theserver(s) 120 may determine that the estimated urgency is high and mayincrease the target speech speed.

The server(s) 120 may determine the target speech speed based onpresence information associated with the listener and/or guests. Forexample, as mentioned above, the server(s) 120 may determine that a newguest has arrived using facial recognition or the like and may increasethe target speech speed accordingly. Additionally or alternatively, theserver(s) 120 may determine that the listener walked away from thesecond device 110 b during playback of the voice message or during acommunication session (e.g., presence is no longer detected) and maypause or decrease the target speech speed until the listener returns tothe second device 110 b (e.g., presence is detected again). Once thelistener returns to the second device 110 b, the server(s) 120 mayincrease the target speech speed until the listener is caught up, atwhich point the server(s) 120 may return to a normal target speechspeed. Thus, the listener may walk away from a communication sessionand/or voice message and come back without missing anything, withplayback at an accelerated rate for a period of time after the listenerreturns.

In some examples, the server(s) 120 may determine the target speechspeed based on an explicit command from the listener. For example, thelistener may indicate in the command (e.g., command audio data) adesired speech speed or may input a follow up command to increase ordecrease the target speech speed. Additionally or alternatively, theserver(s) 120 may infer that the target speech speed should be increasedor decreased based on other commands. For example, if the listenerrequests the same voice message to be repeated, the server(s) 120 maydecrease the target speech speed for subsequent playback.

In some examples, the server(s) 120 may determine the target speechspeed based on cues included in the input data and/or detected in thecommand audio data. For example, the input data may include anindication or notification from a companion device (e.g., smartphone,computer, etc.) that the listener is typing and may decrease the targetspeech speed accordingly. Additionally or alternatively, the server(s)120 may detect typing in the command audio data (e.g., detect soundassociated with a keyboard) and may decrease the target speech speedaccordingly. However, the disclosure is not limited thereto and theserver(s) 120 may detect cues based on image data (e.g., video during avideo communication session, image data captured by the second device110 b, etc.), such as detecting that the listener is typing or writingsomething down or just that the listener is located in proximity to akeyboard or the like.

In some examples, the server(s) 120 may determine the target speechspeed based on information received from other devices. As discussedabove, the input data may include a notification from a companion devicethat the listener is typing. In addition, the companion device may senda notification if the listener is watching a video, scrolling through awebsite, email or document, or the like, which would indicate that thelistener is distracted and multitasking. Additionally or alternatively,the input data may include a notification and/or media information fromother devices associated with the listener, such as devices associatedwith multimedia playback. To illustrate an example, the listener may bewatching video (e.g., a movie or television show) and the server(s) 120may determine the target speech speed based on a location in the videoand/or upcoming content in the video. For example, if the server(s) 120determine that the video is in a commercial break, the server(s) 120 mayincrease the target speech speed to playback the voice message prior tothe end of the commercial break. Similarly, if the server(s) 120determine that a current location in the video corresponds to a quietscene and a loud scene or action or something interesting is coming up,the server(s) 120 may increase the target speech speed. Thus, theserver(s) 120 may receive information from additional devices that mayinfluence the target speech speed.

In some examples, the server(s) 120 may determine the target speechspeed based on analyzing the input audio data (e.g., message speechdata). For example, the server(s) 120 may detect a background noiselevel and/or signal to noise ratio (SNR) and may determine the targetspeech speed accordingly. Thus, portions of the input audio datacorresponding to a low background noise level and/or high SNR may havean increased target speech speed relative to portions of the input audiodata corresponding to a high background noise level and/or low SNR.Similarly, the server(s) 120 may perform automatic speech recognition(ASR) processing on the input audio data and may adjust the targetspeech speed based on an error rate and/or confidence score associatedwith the ASR. For example, when the error rate increases and/orconfidence scores decrease, the server(s) 120 may decrease the targetspeech speed.

In some examples, the server(s) 120 may analyze the input audio data anddetect types of speech, such as a sequence of numbers (e.g., phonenumber), a date, an address or the like. For example, the server(s) 120may decrease a target speech speed for a portion of the input audio datacorresponding to a phone number in order to provide the listeneradditional time to write down the phone number. Additionally oralternatively, the server(s) 120 may detect a foreign accent or foreignlanguage and may decrease the target speech speed.

In some examples, the server(s) 120 may determine different targetspeech speeds and/or speech speed modification factors for differentportions of the input audio data. For example, the server(s) 120 maydetermine a first target speech speed (e.g., 120 wpm) and/or a firstspeech speed modification factor (e.g., 0.8×) for a first portion of theinput audio data and may determine a second target speech speed (e.g.,100 wpm) and/or a second speech speed modification factor (e.g., 0.66×)for a second portion of the input audio data. In this example, thesecond portion may include important information (e.g., phone numbers orthe like) and the server(s) 120 may decrease the second target speechspeed to allow the listener to better understand and/or write down theimportant information. While this example illustrates dividing the inputaudio data into two portions, the disclosure is not limited thereto andthe number of portions may vary without departing from the disclosure.

In some examples, the server(s) 120 may detect multiple people speakingin the input audio data and may separate speech from differentindividuals. For example, a first person and a second person may bespeaking during the input audio data and the server(s) 120 may identifyfirst speech associated with the first person and second speechassociated with the second person. Thus, the server(s) 120 may separatethe first speech and the second speech and may separately process thefirst speech and the second speech (e.g., determine a first targetspeech speed for the first speech and a second target speech speed forthe second speech). Additionally or alternatively, the server(s) 120 maydivide the first speech and/or the second speech into portions and maydetermine different target speech speeds for each of the portions, asdiscussed above. While this example illustrates identifying speechassociated with two different people, the disclosure is not limitedthereto and the server(s) 120 may identify three or more distinctvoices/people without departing from the disclosure. The server(s) 120may identify and separate the speech based on voice recognition,beamforming (e.g., separating the input audio data into multipleseparate beams, with each beam corresponding to a different locationrelative to the microphone(s) 112), input data indicating an identity ofeach user associated with the input audio data, facial recognition(e.g., for input audio data corresponding to video data) and/or thelike.

In some examples, the server(s) 120 may determine the target speechspeed in order to synchronize speech with other audio data. For example,the server(s) 120 may synchronize the speech to match a tempo of a song(e.g., similar to auto-tuning pitch, this would auto-synchronize tempoor cadence). Additionally or alternatively, the server(s) 120 maysynchronize a timing of the first speech and the second speech. Forexample, the first speech and the second speech may correspond to ashared message and/or song (e.g., singing “Happy Birthday”) and theserver(s) 120 may determine the target speech speed for differentportions of the first speech and the second speech in order to alignindividual words and/or pacing between the first speech and the secondspeech. The number of individuals

As used herein, voice normalization refers to performing signalprocessing to modify speech. For example, portions or an entirety ofspeech can be modified to change a speech speed, a volume level or thelike to improve playback of the speech. In some examples, differentportions of the speech may be modified to have different speech speeds,volume levels or the like without departing from the disclosure. Whilevoice normalization may imply modifying the speech to a certain “normal”range (e.g., universal speech speed or the like, such as speeding upslow speech or slowing down fast speech), the disclosure is not limitedthereto. Instead, for ease of explanation, voice normalization may referto any signal processing used to change a speech speed. Thus, slowspeech may be slowed down further, or fast speech sped up, withoutdeparting from the disclosure. To illustrate an example, audio data maycorrespond to speech having a slow speech speed, and a portion of thespeech may include a phone number. While the speech is alreadyassociated with the slow speech speed, the system 100 may further slowthe portion of the speech to provide additional time for a user to writedown the phone number.

For ease of illustration, FIG. 1 and other drawings illustrate theserver(s) 120 performing voice normalization and/or adjusting a speed ofhuman speech playback. However, the disclosure is not limited theretoand a local device (e.g., second device 110 b) may perform the voicenormalization and/or adjusting a speech speed without departing from thedisclosure. Additionally or alternatively, the server(s) 120 and thesecond device 110 b may perform different and/or overlapping stepsassociated with the voice normalization and/or adjusting a speech speed.For example, the server(s) 120 may preprocess the input audio data todetermine estimated target speech speeds corresponding to the inputaudio data and the second device 110 b may adjust the estimated targetspeech speeds during playback of the output audio data. Thus, the seconddevice 110 b may determine target speech speeds and/or may send commandsto the server(s) 120 instructing the server(s) 120 to adjust the targetspeech speeds.

As discussed above, the server(s) 120 may perform voice normalizationand/or adjust a speed of human speech playback associated with a one wayexchange (e.g., voice message) or a two-way exchange (e.g., real-timecommunication session, such as VoIP telephone conversation or videoconversation) without departing from the disclosure. Thus, any referenceto performing voice normalization and/or adjusting a speech speed on avoice message and/or corresponding steps may also apply to real-timecommunication session without departing from the disclosure. Forexample, the server(s) 120 may periodically determine an original speechspeed associated with a person speaking during the communication sessionand determine a speech speed modification factor that modifies theoriginal speech speed to a target speech speed. Thus, during thecommunication session, the server(s) 120 may apply a current speechspeed modification factor to audio data associated with the personspeaking. Thus, the speech speed modification factor may vary over time,depending on the original speech speed of the person speaking. In someexamples, the server(s) 120 may determine individual speech speedmodification factors for each unique voice (e.g., unique identity)detected in the input audio data. For example, first speech associatedwith a first person may be adjusted using a first speech speedmodification factor while second speech associated with a second personmay be adjusted using a second speech speed modification factor.However, the disclosure is not limited thereto and the server(s) 120 mayapply a single speech speed modification factor for multiple userswithout departing from the disclosure.

FIG. 1 illustrates a system configured to adjust a speed of human speechplayback according to embodiments of the present disclosure. Althoughthe figures and discussion illustrate certain operational steps of thesystem in a particular order, the steps described may be performed in adifferent order (as well as certain steps removed or added) withoutdeparting from the intent of the disclosure. As shown in FIG. 1, thesystem may include one or more devices (e.g., PSTN telephone 20, VoIPdevice 30, first device 110 a, and/or second device 110 b) local to auser along with one or more servers 120 connected across one or morenetworks 10. The server(s) 120 (which may be one or more differentphysical devices) may be capable of performing speech processing (e.g.,ASR and NLU) as well as non-speech processing operations as describedherein. A single server 120 may perform all speech processing ormultiple servers 120 may combine to perform all speech processing.

As shown in FIG. 2, a device 110 may receive audio 11 including a spokenutterance of a user via microphone(s) 112 (or array of microphones) ofthe device 110. The device 110 generates audio data 211 corresponding tothe audio 11, and sends the audio data 211 to the server(s) 120 forprocessing. Additionally or alternatively, the device 110 may receivetext input by the user via either a physical keyboard or virtualkeyboard presented on a touch sensitive display of the device 110. Thedevice 110 generates input text data corresponding to the text, andsends the input text data to the server(s) 120 for processing.

The server(s) 120 receives input data from the device 110. If the inputdata is the audio data 211, the server(s) 120 performs speechrecognition processing (e.g., ASR) on the audio data 211 to generateinput text data. The server(s) 120 performs natural language processing(e.g., NLU) on the input text data (either received directly from thedevice 110 or generated from the audio data 211) to determine a usercommand. A user command may correspond to a user request for the systemto output content to the user. The requested content to be output maycorrespond to music, video, search results, weather information, etc.

The server(s) 120 determines output content responsive to the usercommand. The output content may be received from a first party (1P)source (e.g., one controlled or managed by the server(s) 120) or a thirdparty (3P) source (e.g., one managed by an application server(s) (notillustrated) in communication with the server(s) 120 but not controlledor managed by the server(s) 120). The server(s) 120 sends to a device110 output data including the output content responsive to the usercommand. The device 110 may emit the output data as audio and/or presentthe output data on a display.

The system may operate using various components as illustrated in anddescribed with respect to FIG. 2. The various components illustrated inFIG. 2 may be located on a same or different physical device.Communication between various components illustrated in FIG. 2 may occurdirectly or across a network(s) 10.

An audio capture component, such as a microphone or array of microphonesof a device 110, captures the input audio 11 corresponding to a spokenutterance. The device 110, using a wakeword detection component 220,processes audio data corresponding to the input audio 11 to determine ifa keyword (e.g., a wakeword) is detected in the audio data. Followingdetection of a wakeword, the device 110 sends audio data 211,corresponding to the utterance, to a server(s) 120 for processing.

Upon receipt by the server(s) 120, the audio data 211 may be sent to anorchestrator component 230. The orchestrator component 230 may includememory and logic that enable the orchestrator component 230 to transmitvarious pieces and forms of data to various components of the system.

The orchestrator component 230 sends the audio data 211 to a speechprocessing component 240. A speech recognition component 250 of thespeech processing component 240 transcribes the audio data 211 into textdata representing words of speech contained in the audio data 211. Thespeech recognition component 250 interprets the spoken utterance basedon a similarity between the spoken utterance and pre-establishedlanguage models. For example, the speech recognition component 250 maycompare the audio data 211 with models for sounds (e.g., subword unitsor phonemes) and sequences of sounds to identify words that match thesequence of sounds spoken in the utterance of the audio data 211.

Results of speech recognition processing (i.e., text data representingspeech) are processed by a natural language component 260 of the speechprocessing component 240. The natural language component 260 attempts tomake a semantic interpretation of the text data. That is, the naturallanguage component 260 determines the meaning behind the text data basedon the individual words in the text data and then implements thatmeaning. The natural language component 260 interprets a text string toderive an intent or a desired action from the user as well as thepertinent pieces of information in the text data that allow a device(e.g., the device 110, the server(s) 120, the application server(s),etc.) to complete that action. For example, if a spoken utterance isprocessed using the speech recognition component 250, which outputs thetext data “call mom”, the natural language component 260 may determinethe user intended to activate a telephone in his/her device and toinitiate a call with a contact matching the entity “mom.”

The natural language component 260 may be configured to determine a“domain” of the utterance so as to determine and narrow down whichservices offered by an endpoint device (e.g., the server(s) 120 or thedevice 110) may be relevant. For example, an endpoint device may offerservices relating to interactions with a telephone service, a contactlist service, a calendar/scheduling service, a music player service,etc. Words in a single textual interpretation may implicate more thanone service, and some services may be functionally linked (e.g., both atelephone service and a calendar service may utilize data from a contactlist).

The natural language component 260 may include a recognizer thatincludes a named entity resolution (NER) component configured to parseand tag to annotate text as part of natural language processing. Forexample, for the text “call mom,” “call” may be tagged as a command toexecute a phone call and “mom” may be tagged as a specific entity andtarget of the command. Moreover, the telephone number for the entitycorresponding to “mom” stored in a contact list may be included in theNLU results. Further, the natural language component 260 may be used toprovide answer data in response to queries, for example using a naturallanguage knowledge base.

In natural language processing, a domain may represent a discrete set ofactivities having a common theme, such as “shopping,” “music,”“calendaring,” “communications,” etc. As such, each domain may beassociated with a particular recognizer, language model and/or grammardatabase, a particular set of intents/actions, and a particularpersonalized lexicon. Each gazetteer may include domain-indexed lexicalinformation associated with a particular user and/or device. A user'smusic-domain lexical information (e.g., a gazetteer associated with theuser for a music domain) might correspond to album titles, artist names,and song names, for example, whereas a user's contact-list lexicalinformation (e.g., a gazetteer associated with the user for a contactdomain) might include the names of contacts. Since every user's musiccollection and contact list is presumably different, this personalizedinformation improves entity resolution. A lexicon may represent whatparticular data for a domain is associated with a particular user. Theform of the lexicon for a particular domain may be a data structure,such as a gazetteer. A gazetteer may be represented as a vector withmany bit values, where each bit indicates whether a data pointassociated with the bit is associated with a particular user. Forexample, a music gazetteer may include one or more long vectors, eachrepresenting a particular group of musical items (such as albums, songs,artists, etc.) where the vector includes positive bit values for musicalitems that belong in the user's approved music list. Thus, for a songgazetteer, each bit may be associated with a particular song, and for aparticular user's song gazetteer the bit value may be 1 if the song isin the particular user's music list. Other data structure forms forgazetteers or other lexicons are also possible.

As noted above, in traditional natural language processing, text datamay be processed applying the rules, models, and information applicableto each identified domain. For example, if text represented in text datapotentially implicates both communications and music, the text data may,substantially in parallel, be natural language processed using thegrammar models and lexical information for communications, and naturallanguage processed using the grammar models and lexical information formusic. The responses based on the text data produced by each set ofmodels is scored, with the overall highest ranked result from allapplied domains being ordinarily selected to be the correct result.

A downstream process called named entity resolution actually links atext portion to an actual specific entity known to the system. Toperform named entity resolution, the system may utilize gazetteerinformation stored in an entity library storage. The gazetteerinformation may be used for entity resolution, for example matchingspeech recognition results with different entities (e.g., song titles,contact names, etc.). Gazetteers may be linked to users (e.g., aparticular gazetteer may be associated with a specific user's musiccollection), may be linked to certain domains (e.g., shopping, music,communications), or may be organized in a variety of other ways. The NERcomponent may also determine whether a word refers to an entity that isnot explicitly mentioned in the text data, for example “him,” “her,”“it” or other anaphora, exophora or the like.

A recognizer of the natural language component 260 may also include anintent classification (IC) component that processes text data todetermine an intent(s), where the intent(s) corresponds to the action tobe performed that is responsive to the user command represented in thetext data. Each recognizer is associated with a database of words linkedto intents. For example, a music intent database may link words andphrases such as “quiet,” “volume off,” and “mute” to a “mute” intent.The IC component identifies potential intents by comparing words in thetext data to the words and phrases in the intents database.Traditionally, the IC component determines using a set of rules ortemplates that are processed against the incoming text data to identifya matching intent.

In order to generate a particular interpreted response, the NERcomponent applies the grammar models and lexical information associatedwith the respective recognizer to recognize a mention of one or moreentities in the text represented in the text data. In this manner theNER component identifies “slots” (i.e., particular words in text data)that may be needed for later command processing. Depending on thecomplexity of the NER component, it may also label each slot with a type(e.g., noun, place, city, artist name, song name, or the like). Eachgrammar model includes the names of entities (i.e., nouns) commonlyfound in speech about the particular domain (i.e., generic terms),whereas the lexical information from the gazetteer is personalized tothe user(s) and/or the device. For instance, a grammar model associatedwith the shopping domain may include a database of words commonly usedwhen people discuss shopping.

The intents identified by the IC component are linked to domain-specificgrammar frameworks with “slots” or “fields” to be filled. Eachslot/field corresponds to a portion of the text data that the systembelieves corresponds to an entity. For example, if “play music” is anidentified intent, a grammar framework(s) may correspond to sentencestructures such as “Play {Artist Name},” “Play {Album Name},” “Play{Song name},” “Play {Song name} by {Artist Name},” etc. However, to makeresolution more flexible, these frameworks would ordinarily not bestructured as sentences, but rather based on associating slots withgrammatical tags.

For example, the NER component may parse the text data to identify wordsas subject, object, verb, preposition, etc., based on grammar rulesand/or models, prior to recognizing named entities. The identified verbmay be used by the IC component to identify intent, which is then usedby the NER component to identify frameworks. A framework for an intentof “play” may specify a list of slots/fields applicable to play theidentified “object” and any object modifier (e.g., a prepositionalphrase), such as {Artist Name}, {Album Name}, {Song name}, etc. The NERcomponent then searches the corresponding fields in the domain-specificand personalized lexicon(s), attempting to match words and phrases inthe text data tagged as a grammatical object or object modifier withthose identified in the database(s).

To illustrate an example, a command of “book me a plane ticket fromBoston to Seattle for July 5” may be associated with a <BookPlaneTicket>intent. The <BookPlaneTicket> intent may be associated with a frameworkincluding various slots including, for example, <DepartureDate>,<DepartureLocation>, <ArrivalDate>, and <DestinationLocation>. In theabove example, the server(s) 120, namely the natural language component260, may populate the framework as follows: <DepartureDate: July 5>,<DepartureLocation: Boston>, <ArrivalDate: July 5>, and<DestinationLocation: Seattle>.

This process includes semantic tagging, which is the labeling of a wordor combination of words according to their type/semantic meaning.Parsing may be performed using heuristic grammar rules, or the NERcomponent may be constructed using techniques such as HMMs, maximumentropy models, log linear models, conditional random fields (CRF), andthe like.

For instance, a query of “play mother's little helper by the rollingstones” might be parsed and tagged as {Verb}: “Play,” {Object}:“mother's little helper,” {Object Preposition}: “by,” and {ObjectModifier}: “the rolling stones.” At this point in the process, “Play” isidentified as a verb based on a word database associated with the musicdomain, which the IC component will determine corresponds to the “playmusic” intent. At this stage, no determination has been made as to themeaning of “mother's little helper” and “the rolling stones,” but basedon grammar rules and models, it is determined that the text of thesephrases relate to the grammatical object (i.e., entity) of the textdata.

The frameworks linked to the intent are then used to determine whatdatabase fields should be searched to determine the meaning of thesephrases, such as searching a user's gazette for similarity with theframework slots. So a framework for “play music intent” might indicateto attempt to resolve the identified object based on {Artist Name},{Album Name}, and {Song name}, and another framework for the same intentmight indicate to attempt to resolve the object modifier based on{Artist Name}, and resolve the object based on {Album Name} and {SongName} linked to the identified {Artist Name}. If the search of thegazetteer does not resolve the slot/field using gazetteer information,the NER component may search a database of generic words associated withthe domain. For example, if the text data corresponds to “play songs bythe rolling stones,” after failing to determine an album name or songname called “songs” by “the rolling stones,” the NER component maysearch the domain vocabulary for the word “songs.” In the alternative,generic words may be checked before the gazetteer information, or bothmay be tried, potentially producing two different results.

The results of natural language processing may be tagged to attributemeaning to the text data. So, for instance, “play mother's little helperby the rolling stones” might produce a result of: {domain} Music,{intent} Play Music, {artist name} “rolling stones,” {media type} SONG,and {song title} “mother's little helper.” As another example, “playsongs by the rolling stones” might produce: {domain} Music, {intent}Play Music, {artist name} “rolling stones,” and {media type} SONG.

The results of natural language processing may be sent to an application290, which may be located on a same or separate server 120 as part ofsystem. The system may include more than one application 290, and thedestination application 290 may be determined based on the naturallanguage processing results. For example, if the natural languageprocessing results include a command to play music, the destinationapplication 290 may be a music playing application, such as one locatedon the device 110 or in a music playing appliance, configured to executea music playing command. If the natural language processing resultsinclude a search request (e.g., requesting the return of searchresults), the application 290 selected may include a search engineapplication, such as one located on a search server, configured toexecute a search command and determine search results, which may includeoutput text data to be processed by a text-to-speech engine and outputfrom a device as synthesized speech.

The server(s) 120 may include a user recognition component 295. The userrecognition component 295 may take as input the audio data 211 as wellas the text data output by the speech recognition component 250. Theuser recognition component 295 may receive the text data from the speechrecognition component 250 either directly or indirectly via theorchestrator component 230. Alternatively, the user recognitioncomponent 295 may be implemented as part of the speech recognitioncomponent 250. The user recognition component 295 determines respectivescores indicating whether the utterance in the audio data 211 was spokenby particular users. The user recognition component 295 also determinesan overall confidence regarding the accuracy of user recognitionoperations. User recognition may involve comparing speechcharacteristics in the audio data 211 to stored speech characteristicsof users. User recognition may also involve comparing biometric data(e.g., fingerprint data, iris data, etc.) received by the userrecognition component 295 to stored biometric data of users. Userrecognition may further involve comparing image data including arepresentation of at least a feature of a user with stored image dataincluding representations of features of users. It should be appreciatedthat other kinds of user recognition processes, including those known inthe art, may be used. Output of the user recognition component 295 maybe used to inform natural language processing as well as processingperformed by 1P and 3P applications 290.

The server(s) 120 may additionally include user profile storage 270. Theuser profile storage 270 includes data regarding user accounts. Asillustrated, the user profile storage 270 is implemented as part of theserver(s) 120. However, it should be appreciated that the user profilestorage 270 may be located proximate to the server(s) 120, or mayotherwise be in communication with the server(s) 120, for example overthe network(s) 10. The user profile storage 270 may include a variety ofinformation related to individual users, accounts, etc. that interactwith the system.

FIG. 2 illustrates various 1P applications 290 of the system. However,it should be appreciated that the data sent to the 1P applications 290may also be sent to 3P application servers executing 3P applications.

In some examples, an application 290 may correspond to a communicationsapplication configured to control communications (e.g., a communicationsession, including audio data and/or image data), voice messages (e.g.,playback one or more voice messages stored in voicemail or the like), orthe like. The communications application may be configured to performnormalization (e.g., power normalization, volume normalization or thelike) and/or adjust a speech speed of audio data associated with thecommunication session (e.g., in real-time during the communicationsession) or the voice messages (e.g., offline, prior to playback of thevoice messages, and/or in real-time during playback of the voicemessages). For example, the communication application may modify thespeech speed of a voice message at a first time and, after the userrequests playback of the voice message, may output the modified voicemessage at a second time. In some examples, the communicationapplication may begin outputting the modified voice message at thesecond time and may adjust the speech speed and generate a secondmodified voice message during playback (e.g., in response to a usercommand). Additionally or alternatively, the communication applicationmay begin outputting the original voice message when the user requestsplayback of the voice message and may modify the speech speed of thevoice message during playback (e.g., in response to a user command).

Application, as used herein, may be considered synonymous with a skill.A “skill” may correspond to a domain and may be software running on aserver(s) 120 and akin to an application. That is, a skill may enable aserver(s) 120 or application server(s) to execute specific functionalityin order to provide data or produce some other output called for by auser. The system may be configured with more than one skill. For examplea weather service skill may enable the server(s) 120 to execute acommand with respect to a weather service server(s), a car service skillmay enable the server(s) 120 to execute a command with respect to a taxiservice server(s), an order pizza skill may enable the server(s) 120 toexecute a command with respect to a restaurant server(s), etc.

Output of the application/skill 290 may be in the form of text data tobe conveyed to a user. As such, the application/skill output text datamay be sent to a text-to-speech (TTS) component 280 either directly orindirectly via the orchestrator component 230. The TTS component 280 maysynthesize speech corresponding to the received text data. Speech audiodata synthesized by the TTS component 280 may be sent to a device 110for output to a user.

The TTS component 280 may perform speech synthesis using one or moredifferent methods. In one method of synthesis called unit selection, theTTS component 280 matches the text data or a derivative thereof againsta database of recorded speech. Matching units are selected andconcatenated together to form speech audio data. In another method ofsynthesis called parametric synthesis, the TTS component 280 variesparameters such as frequency, volume, and noise to create an artificialspeech waveform output. Parametric synthesis uses a computerized voicegenerator, sometimes called a vocoder.

FIG. 3A is a conceptual diagram of how adjusting a speed of human speechplayback is performed according to examples of the present disclosure.As illustrated in FIG. 3A, input audio 11 may be captured by aspeech-controlled device 110 (e.g., second device 110 b) as commandaudio data 13 and the command audio data 13 may be sent to the server(s)120. The server(s) 120 may process the command audio data 13 anddetermine that the command audio data 13 corresponds to a command toplay voice messages stored on the server(s) 120. The server(s) 120 mayidentify message audio data 15 corresponding to a voice message and mayperform voice normalization and/or adjust a speed of human speechplayback on the message audio data 15. To perform voice normalizationand/or adjust a speech speed, the server(s) 120 may receive input data320 from the speech-controlled device 110 and/or additional devices.Additionally or alternatively, the server(s) 120 may analyze the commandaudio data 13 to generate command speech data 340 and/or analyze themessage audio data 15 to generate message speech data 360.

As illustrated in FIG. 3A, in some examples the server(s) 120 mayreceive a voice command (e.g., command audio data 13) instructing theserver(s) 120 to playback the voice message and the server(s) 120 mayanalyze the command audio data 13 to determine the command and togenerate the command speech data 340, which may be used to determine thetarget speech speed. However, the disclosure is not limited thereto andthe server(s) 120 may receive a command that is not associated withcommand audio data 13 without departing from the disclosure. Forexample, the second device 110 b may receive input on a touchscreendisplay that corresponds to the command and may send the command to theserver(s) 120 to instruct the server(s) 120 to begin playback of one ormore voice messages. Thus, the server(s) 120 may perform voicenormalization and/or adjust a speech speed without receiving the commandaudio data 13 and/or generating the command speech data 340.

As discussed above, the server(s) 120 may perform voice normalizationand/or adjust a speech speed by determining (310) an original speechspeed associated with speech represented in the message audio data 15(e.g., input audio data), determining (312) a target speech speed,determining (314) a speech speed modification factor and generating(316) output audio data based on the message audio data 15 and thespeech speed modification factor. The server(s) 120 may determine thetarget speech speed dynamically for portions of the message audio data15 based on the input data 320, the command speech data 340 and/or themessage speech data 360.

As illustrated in FIGS. 3A-3B, the input data 320 may include a varietyof information, such as user preferences 322 (e.g., previous settingsassociated with the user, such as a preferred target speech speed (e.g.,playback speed), target speech speeds associated with differentidentities associated with the message audio data 15, etc.), calendarentries 324 (e.g., calendar data associated with upcoming or recentmeeting information or the like), location information of a listener 326(e.g., location data indicating current location of the second device110 b during playback, such as GPS coordinates or the like), presenceinformation of the listener 328 (e.g., whether human presence isdetected by the second device 110 b), a number of voice messages 330(e.g., a total number of voice messages), explicit commands to changespeech speed 332 (e.g., a command input to a companion device, aprevious request of “Alexa, increase speech speed,” or the like), mediainformation 334 (e.g., information about content being viewed by thelistener, received from the second device 110 b and/or a companiondevice associated with the account), typing detected notification 336(e.g., notification that the listener is typing received from the seconddevice 110 b or a companion device associated with the account) and/orother data 338 (e.g., any other non-audio data associated with thecommand to playback the voice message at the time that the command isreceived and/or during playback).

An example of other data 338 may include identity information associatedwith a listener. For example, a companion device (e.g., smart phone 110a) may be associated with a particular user profile, and if thecompanion device is in proximity to the second device 110 b when thesecond device 110 b receives input audio data, the system 100 mayinclude an identity associated with the user profile in the input data320. Alternatively, the companion device may receive the command andgenerate the input audio data, in which case the companion device mayinclude the identity associated with the companion device in the inputdata 320. Similarly, the second device 110 b itself may be associatedwith an identity and the input data 320 may include the identityassociated with the second device 110 b in the input data 320.Additionally or alternatively, other techniques may be used to identifythe identity of the user speaking the command (e.g., the listener). Forexample, the identity of the listener may be determined using facialrecognition or the like, and the server(s) 120 may receive an indicationof the identity of the listener as part of the input data 320

As illustrated in FIG. 3A and FIG. 3C, the server(s) 120 may analyze thecommand audio data 13 to generate command speech data 340, which mayinclude a variety of information such as a command speech speed 342(e.g., detected speech speed of speech represented in the command audiodata 13), speech urgency data (e.g., determined based on content of thecommand audio data 13), an identity of the listener 346 (e.g., identityof the user associated with the command to playback voice messages,which may be determined based on voice recognition, facial recognition,data received from a companion device or the like), typing detectednotification 348 (e.g., sounds associated with a keyboard and/or typingdetected in the command audio data 13), explicit commands to changespeech speed 350 (e.g., “Alexa, increase speech speed”),conversation/interruption 352 (e.g., detecting that conversation orother interruption occurs during playback of the voice messages), and/orother data 354. If the identity of the listener is not determined basedon the command audio data 13 (e.g., not determined using voicerecognition or the like), the identity of the listener may instead beassociated with the input data 320. For example, if the identity of thelistener is determined using facial recognition or based on a companiondevice (e.g., smartphone) associated with a user, the server(s) 120 mayreceive an indication of the identity of the listener as part of theinput data 320 without departing from the disclosure.

In some examples, the command audio data 13 corresponds to audio datareceived that instructs the server(s) 120 to perform playback of thevoice messages (e.g., audio data prior to playback). However, thedisclosure is not limited thereto and the command audio data maycorrespond to audio data received during playback of the voice messages.Thus, the typing detected notification 348, explicit commands to changespeech speed 350, and/or the conversation/interruption 352 maycorrespond to when the voice message is being played back by the seconddevice 110 b without departing from the disclosure.

As used herein, the input data 320 and the command speech data 340 maybe collectively referred to as configuration data. Thus, configurationdata corresponds to non-message data used to determine the target speechspeed. For example, the configuration data may correspond to informationabout the listener (e.g., user preferences, calendar entries, locationinformation, identity, or the like that is included in a user profileassociated with the listener), contextual information (e.g., a number ofvoice messages, previous commands, media information, typing detectednotification, etc.), the command (e.g., command speech speed, speechurgency data, audio cues included in the command audio data 13, etc.),or the like.

As illustrated in FIG. 3A and FIG. 3D, the server(s) 120 may analyze themessage audio data 15 to generate message speech data 360, which mayinclude a variety of information such as a message speech speed 362(e.g., original speech speed(s) for speech represented in the messageaudio data 15), a background noise level 364 (e.g., background noisepower associated with the message audio data 15), a signal to noiseratio (SNR) 366 associated with the message audio data 15, an errorrate/confidence score 368 associated with portions of the message audiodata 15 after performing speech recognition or the like, an identity ofa speaker 370 (e.g., identity of a user associated with speechrepresented in the message audio data 15), multiple speakers detected372 (e.g., an indication that speech is detected from multiple differentpeople), numbers detected 374 (e.g., detected sounds that may correspondto numbers, such as a sequence of numbers corresponding to a phonenumber), an indication that an accent is detected 376 (e.g., detectingthat the speech is associated with a foreign accent is therefore moredifficult to understand), and/or other data 378.

An example of information included in other data 378 is an ageassociated with the speaker. For example, a young child may be moredifficult to understand and therefore the server(s) 120 may slow aspeech speed to improve performance. In some examples, the age may beincluded in the identity of the speaker 370, although the disclosure isnot limited thereto. The other data 378 may also include additionalinformation that the server(s) 120 determine to be important, such asnames, times, dates, phone numbers, addresses or the like. For example,the server(s) 120 may perform automatic speech recognition and/ornatural language understanding to identify entities and may determinethat the entities are likely to be significant. Thus, the server(s) 120may analyze content of the voice message, detect portions that arelikely to be important and adjust the target speech speed to improveplayback of the voice message.

The message speech speed 362 may determine an overall speech speedassociated with the voice message, such as an average speech speed overan entirety of the voice message. However, the disclosure is not limitedthereto, and the message speech speed 362 may include more complex datasuch as an average speech speed for different portions of the voicemessage. For example, the message speech speed 362 may indicate anaverage speech speed for a fixed duration of time (e.g., 2 seconds, 5seconds, etc.). Additionally or alternatively, the server(s) 120 mayidentify portions of the voice message associated with similar speechspeeds (e.g., a range of speech speeds) and the message speech speed 362may indicate the portions and an average speech speed for each portion.For example, the voice message may start at a first speech speed for 5seconds and then slow to a second speech speed for 15 seconds. Thus, themessage speech speed 362 included in the message speech data 360 mayindicate that the first portion corresponds to the first 5 seconds(e.g., begin time=0 s, end time=5 s) and has the first speech speed(e.g., speech speed=150 words per minute) and that the second portioncorresponds to the next 15 seconds (e.g., begin time=5 s, end time=20 s)and has the second speech speed (e.g., speech speed=100 words perminute). Thus, the original speech speed may vary over time and theserver(s) 120 may track variations in the original speech speed andmodify a speech speed modification factor to match the target speechspeed.

While not illustrated in FIG. 3A and/or FIG. 3D, the server(s) 120 mayoptionally perform voice activity detection (VAD) to detect voiceactivity (e.g., speech) in the command audio data 13 and/or the messageaudio data 15 without departing from the disclosure. The server(s) 120may perform VAD using techniques known to one of skill in the art andperforming the VAD may reduce a processing load on the server(s) 120 asthe server(s) 120 may only perform voice normalization and/or adjust aspeech speed for portions of the command audio data 13 and/or themessage audio data 15 that correspond to speech. VAD techniques maydetermine whether speech is present in a particular section of audiodata based on various quantitative aspects of the audio data, such asthe spectral slope between one or more frames of the audio data; theenergy levels of the audio data in one or more spectral bands; thesignal-to-noise ratios of the audio data in one or more spectral bands;or other quantitative aspects. In other embodiments, the server(s) 120may implement a limited classifier configured to distinguish speech frombackground noise. The classifier may be implemented by techniques suchas linear classifiers, support vector machines, and decision trees. Instill other embodiments, Hidden Markov Model (HMM) or Gaussian MixtureModel (GMM) techniques may be applied to compare the audio data to oneor more acoustic models. The acoustic models may include modelscorresponding to speech, noise (such as environmental noise orbackground noise), or silence. Still other techniques may be used todetermine whether speech is present in the audio data.

FIGS. 4A-4B are flowcharts conceptually illustrating example methods foradjusting a speed of human speech playback according to examples of thepresent disclosure. As illustrated in FIG. 4A, the server(s) 120 mayreceive (410) a command to play a voice message, may receive (412) inputaudio data and may receive (414) input data. As discussed above, theinput data may correspond to non-audio data associated with the account,the device (e.g., second device 110 b) and/or the user without departingfrom the disclosure.

The server(s) 120 may optionally determine (416) individual speech frommultiple users (e.g., separate the input audio data into differentsegments of speech, with each segment of speech corresponding to aunique user/speaker). The server(s) 120 may also optionally generate(418) command speech data associated with command audio data 13, such aswhen the command is a voice command. However, the disclosure is notlimited thereto and even when the command to play the voice message isnot a voice command, the server(s) 120 may receive command audio data 13from the second device 110 b (e.g., during playback of the voicemessage) and may generate the command speech data based on the commandaudio data 13 without departing from the disclosure.

The server(s) 120 may generate (420) message speech data associated withthe input audio data, and, either as part of generating the messagespeech data or as a separate step, may determine (422) an originalspeech speed associated with the input audio data. In some examples, theserver(s) 120 may determine different original speech speeds associatedwith different portions of the input audio data, such as when the userspeeds up or slows down while leaving the voice message. The server(s)120 may determine (424) a target speech speed, as discussed in greaterdetail above with regard to FIG. 1, and may determine (426) a speechspeed modification factor based on the target speech speed and theoriginal speech speed. For example, the server(s) 120 may divide thetarget speech speed by the original speech speed to determine the speechspeed modification factor.

The server(s) 120 may optionally determine (428) to apply the speechspeed modification factor to a portion of the input audio data. Forexample, the server(s) 120 may apply different speech speed modificationfactors to different portions of the input audio data without departingfrom the disclosure. Additionally or alternatively, the server(s) 120may optionally determine (430) variations in the speech speedmodification factor to reduce a distortion of the output audio data. Forexample, instead of abruptly changing from a first speech speed to asecond speech speed, the server(s) 120 may incrementally transition fromthe first speech speed to the second speech speed using a maximumtransition value to avoid abrupt changes in the output audio data thatcould result in distortion. In some examples, the server(s) 120 maydetermine a number of increments by dividing a change in speech speed(e.g., difference between the first speech speed factor and the secondspeech speed factor) by the maximum transition value, with the number ofincrements indicating a number of discrete speech speed modificationfactors over which to transition from a first speech speed factor to asecond speech speed factor using the maximum transition value.

The server(s) 120 may optionally determine (432) a volume modificationfactor and/or determine (434) to insert additional pauses. The volumemodification factor may increase a volume of the output audio data andthe server(s) 120 may determine the volume modification factor based onthe target speech speed associated with the output audio data. Forexample, the server(s) 120 may identify that a portion of the inputaudio data corresponds to a decreased target speech speed and mayincrease the volume modification factor for the portion of the inputaudio data, as discussed in greater detail below with regard to FIGS.8A-8B. Thus, the server(s) 120 may simultaneously slow down the targetspeech speed while increasing a volume level to further improve playbackof the voice message and enable the listener to understand speechrepresented in the output audio data.

The server(s) 120 may generate (436) output audio data based on theinput audio data and the speech speed modification factor(s) determinedin step 426. In some examples, the server(s) 120 may determine a singlespeech speed modification factor associated with the input audio dataand may generate the output audio data by applying the speech speedmodification factor to an entirety of the input audio data. However, thedisclosure is not limited thereto and the server(s) 120 may insteaddetermine a plurality of speech speed modification factors and maygenerate the output audio data by applying the plurality of speech speedmodification factors to corresponding portions of the input audio datawithout departing from the disclosure.

FIG. 4B is similar to FIG. 4A but is intended to illustrate how theserver(s) 120 dynamically adjust target speech speeds and/or speechspeed modification factors for different portions of the input audiodata.

As illustrated in FIG. 4B, the server(s) 120 may determine (450)individual speech from multiple users (e.g., separate the input audiodata into different segments of speech, with each segment of speechcorresponding to a unique user/speaker). The server(s) 120 may select(452) speech associated with a user (e.g., first speech associated witha first user) and may determine (454) a portion of the input audio dataassociated with the user. The server(s) 120 may determine (456) anoriginal speech speed for the selected portion, may determine (458) atarget speech speed for the selected portion, and may determine (460) aspeech speed modification factor for the selected portion. For example,the server(s) 120 may divide the target speech speed by the originalspeech speed to determine the speech speed modification factor.

The server(s) 120 may optionally determine (462) variations in thespeech speed modification factor for the selected portion based on themaximum transition value (e.g., to reduce distortion in the output audiodata as discussed above with regard to step 430), may optionallydetermine (464) a volume modification factor for the selected portion(e.g., to improve playback of the voice message, as discussed above withregard to step 432) and may optionally determine (466) to insertadditional pauses in the selected portion.

The server(s) 120 may determine (468) if an additional portion of theinput audio data is present, and if so, may loop to step 454 and repeatsteps 454-466 for the additional portion. If an additional portion ofthe input audio data is not present, the server(s) 120 may determine(470) if an additional user is present and, if so, may loop to step 452and repeat steps 452-468 for speech associated with the additional user.If an additional user is not determined to be represented in the inputaudio data, the server(s) 120 may generate (472) output audio data basedon the input audio data, the speech speed modification factor(s)determined in steps 460 and 462, the volume modification factor(s)determined in step 464, and/or the additional pauses inserted in step466.

While many examples described above refer to adjusting a speech speed ofa voice message, the disclosure is not limited thereto. Instead, theserver(s) 120 may adjust a speech speed in real-time during acommunication session without departing from the disclosure. Forexample, the server(s) 120 may periodically determine an original speechspeed associated with a person speaking during the communication sessionand determine a speech speed modification factor that modifies theoriginal speech speed to a target speech speed. Thus, during thecommunication session, the server(s) 120 may apply a current speechspeed modification factor to audio data associated with the personspeaking. Thus, the speech speed modification factor may vary over time,depending on the original speech speed of the person speaking. In someexamples, the server(s) 120 may determine individual speech speedmodification factors for each unique voice (e.g., unique identity)detected in the input audio data. For example, first speech associatedwith a first person may be adjusted using a first speech speedmodification factor while second speech associated with a second personmay be adjusted using a second speech speed modification factor.However, the disclosure is not limited thereto and the server(s) 120 mayapply a single speech speed modification factor for multiple userswithout departing from the disclosure.

FIG. 5 illustrates an example of applying different speech speedmodification factors to different portions of input audio data accordingto examples of the present disclosure. As shown in FIG. 5, a speechspeed modification chart 510 illustrates a variety of speech speedmodification factors (which may be referred to as speech speed factorswithout departing from the disclosure), such as a first speech speedfactor 512, a second speech speed factor 514 and a third speech speedfactor 516. The first speech speed factor 512 (e.g., 1×) may correspondto a neutral speech speed factor that does not modify a speech speed ofthe input audio data, and the first speech speed factor 512 may be usedanywhere that the input audio data does not need to be modified. Forexample, if the original speech speed is similar to the target speechspeed, the server(s) 120 may use the first speech speed factor 512throughout the input audio data.

In contrast, the second speech speed factor 514 corresponds to a lowerspeech speed factor (e.g., 0.66×), which is used to intentionally slowdown a portion of the input audio data. For example, the server(s) 120may detect that the original speech speed is too fast relative to thetarget speech speed for a first portion of the input audio data and mayuse the second speech speed factor 514 to decrease a speech speedassociated with the first portion. Similarly, the third speech speedfactor 516 corresponds to a higher speech speed factor (e.g., 0.7×),which is used to intentionally speed up a portion of the input audiodata. For example, the server(s) 120 may detect that the original speechspeed is too slow relative to the target speech speed for a secondportion of the input audio data and may use the third speech speedfactor 516 to increase a speech speed associated with the secondportion.

As illustrated in FIG. 5, the server(s) 120 may vary the speech speedmodification factor throughout the input audio data, allowing theserver(s) 120 to dynamically determine an appropriate speech speedmodification factor based on characteristics associated with the inputaudio data. For example, portions of the input audio data correspondingto information which a listener may need to write down or record (e.g.,phone number, names, etc.) may be slowed down while other portions ofthe input audio data are left as is or sped up.

FIG. 6 illustrates an example of incrementally changing speech speedmodification factors to avoid distortion according to examples of thepresent disclosure. While FIG. 5 illustrates the server(s) 120 varyingspeech speed modification factors for different portions of the inputaudio data, FIG. 6 is directed instead to the server(s) 120incrementally transitioning to a speech speed modification factor toreduce distortion in the output audio data. For example, instead ofabruptly transitioning from a first speech speed modification factor toa second speech speed modification factor, the server(s) 120 maytransition incrementally over time to avoid abrupt changes in the outputaudio data that could result in distortion.

As shown in FIG. 6, a speech speed modification chart 610 illustrateschanging from a first speech speed factor 612 to a second speech speedfactor 614. To avoid an abrupt change in speech speed, the server(s) 120may increment the speech speed modification factor slowly over time. Forexample, the server(s) 120 may determine a number of increments bydividing a change in speech speed (e.g., difference between the firstspeech speed factor 612 and the second speech speed factor 614) by amaximum transition value, with the number of increments indicating howmany speech speed modification factors with which to transition from thefirst speech speed factor 612 to the second speech speed factor 614using the maximum transition value. The server(s) 120 may apply eachspeech speed modification factor for a minimum number of audio samples(e.g., minimum duration of time) before transitioning to the next speechspeed modification factor.

To illustrate an example, if the maximum transition value is 0.1× andthe difference between the first speech speed factor 612 (e.g., 1×) andthe second speech speed factor 614 (e.g., 0.7×) is 0.3×, the server(s)120 may transition from the first speech speed factor 612 to the secondspeech speed factor 614 using a total of three increments of 0.1× each.Thus, the server(s) 120 may transition from the first speech speedfactor 612 (e.g., 1×) to a first intermediate speech speed factor 616 a(e.g., 0.9×), from the first intermediate speech speed factor 616 a to asecond intermediate speech speed factor 616 b (e.g., 0.8×), and from thesecond intermediate speech speed factor 616 b to the second speech speedfactor 614 (e.g., 0.7×).

To transition back from the second speech factor 614 to the first speechfactor 612, the server(s) 120 may repeat the process and transition fromthe second speech speed factor 614 (e.g., 0.7×) to the secondintermediate speech speed factor 616 b (e.g., 0.8×), from the secondintermediate speech speed factor 616 b to the first intermediate speechspeed factor 616 a (e.g., 0.9×), and from the first intermediate speechspeed factor 616 a to the first speech speed factor 612 (e.g., 1×).

While FIG. 6 illustrates the server(s) 120 transitioning back from thesecond speech speed factor 614 to the first speech speed factor 612, thedisclosure is not limited thereto and the server(s) 120 may transitionfrom the second speech speed factor 614 to a third speech speed factorwithout departing from the disclosure. For example, the server(s) 120may determine a difference between the second speech speed factor 614and the third speech speed factor and may transition based on themaximum transition value, as discussed above.

FIG. 7 illustrates examples of modifying a speech speed and insertingadditional pauses in output audio data according to examples of thepresent disclosure. As shown in FIG. 7, an input chart 710 illustratesinput audio data 712 having a first duration 714. After performing voicenormalization on and/or adjusting a speech speed of the input audio data712, output chart 730 illustrates output audio data 722 having a secondduration 724. As illustrated by the second duration 724, the server(s)120 decreased a target speech speed relative to the original speechspeed, such that a second speech speed associated with the output audiodata 722 is slower than a first speech speed associated with the inputaudio data 712.

In addition to modifying the second speech speed associated with theoutput audio data, the server(s) 120 may also insert additional pausesin the input audio data 712. For example, output chart 730 illustratesoutput audio data 732 having a third duration 734, which is caused byinserting pauses 736. Thus, a third speech speed associated with theoutput audio data 732 is identical to the second speech speed associatedwith the output audio data 722, but the additional pauses 736 increasethe third duration 734 relative to the second duration 724. Theadditional pauses 736 may provide a listener with additional time tounderstand and/or write down information included in the output audiodata 732.

FIGS. 8A-8B illustrate examples of modifying a volume of input audiodata in conjunction with modifying a speech speed according to examplesof the present disclosure. FIG. 8A illustrates a first example ofincreasing (e.g., boosting) a volume level of a portion of the inputaudio data, such that a maximum volume level of the output audio data isgreater than a maximum volume level of the input audio data. Incontrast, FIG. 8B illustrates a second example of increasing (e.g.,repairing) volume levels within the portion of the input audio data forindividual words/sentences, such that a maximum volume level of theoutput audio data is identical to a maximum volume level of the inputaudio data but portions of the output audio data have a higher volumelevel than corresponding portions of the input audio data.

As shown in FIG. 8A, an input chart 810 illustrates input audio data 812without a speech speed modification factor being applied. In contrast,output chart 820 illustrates output audio data 822 having a first speechspeed modification factor 824 (e.g., 1×) applied to a first portion anda second speech speed modification factor 826 (e.g., 0.7×) applied to asecond portion. Thus, a first speech speed associated with the firstportion is identical to the input audio data (e.g., original speechspeed), whereas a second speech speed associated with the second portionis decreased relative to the input audio data.

In addition to changing a speech speed, the server(s) 120 may alsomodify a volume level associated with the second portion. For example,output chart 830 illustrates output audio data 832 having the firstspeech speed modification factor 824 (e.g., 1×) applied to a firstportion and the second speech speed modification factor 826 (e.g., 0.7×)applied to a second portion. In addition, the output audio data 832 hasa normal volume level 834 associated with the first portion and aboosted volume level 836 associated with the second portion. Theserver(s) 120 may generate the boosted volume level 836 using a volumemodification factor. For example, the server(s) 120 may determine thevolume modification factor based on the second speech speed modificationfactor 826 and may apply the volume modification factor to the secondportion of the input audio data. Thus, the volume is increased for thesecond portion in order to improve playback of the output audio data.

FIG. 8B illustrates the input chart 810 and the output chart 820, asdiscussed above with regard to FIG. 8A, as well as an output chart 840that illustrates output audio data 842 having the first speech speedmodification factor 824 (e.g., 1×) applied to a first portion and thesecond speech speed modification factor 826 (e.g., 0.7×) applied to asecond portion. However, in contrast to the output audio data 832 thathas a boosted volume level 836 associated with the second portion, theoutput audio data 842 has a normal volume level 844 associated with thefirst portion and a modified volume level 846 associated with the secondportion.

The server(s) 120 may generate the modified volume level 846 using avolume modification factor and a maximum threshold value (e.g., maximumvolume level). For example, the server(s) 120 may determine the volumemodification factor based on the second speech speed modification factor826 and may apply the volume modification factor to the second portionof the input audio data, with any volume levels above the maximum volumelevel being capped at the maximum volume level. Thus, instead ofincreasing a maximum volume level of the second portion, the server(s)120 increase a volume level of individual words/sentences within theoutput audio data 842 to be closer to the maximum volume level. Thisremoves variations in volume levels between words/sentences, which maybe caused by variations in a volume of speech, variations in distancebetween a speaker and the microphone(s) 112, or the like. As a result ofthe modified volume 846, playback of the output audio data 842 may beimproved and speech included in the output audio data 842 may be morereliably understood by the listener.

While the output chart 830 illustrates the server(s) 120 increasing amaximum volume level of the output audio data 832 and the output chart840 illustrates the server(s) 120 increasing volume levels within theoutput audio data 842 to be closer to the maximum volume level, thedisclosure is not limited thereto and the server(s) 120 may increase amaximum volume level of output audio data and increase volume levelswithin output audio data without departing from the disclosure.

FIG. 9 illustrates an example of identifying speech from multiple usersand applying different speech speed modification factors based on theuser according to examples of the present disclosure. As illustrated inFIG. 9, combined speech 910 may include speech associated with threedifferent users. For example, the server(s) 120 may separate thecombined speech 910 into first speech 912 associated with a first user,second speech 914 associated with a second user, and third speech 916associated with a third user. After separating the first speech 912, thesecond speech 914, and the third speech 916, the server(s) 120 mayperform voice normalization on and/or adjust a speech speed of eachportion separately. For example, FIG. 9 illustrates modified firstspeech 922 associated with the first user, modified second speech 924associated with the second user, and modified third speech 926associated with the third user. As illustrated in FIG. 9, the server(s)120 may determine target speech speeds in order to synchronize themodified first speech 922, the modified second speech 924 and themodified third speech 926. Thus, the server(s) 120 may adjust variationsin speech speed so that a timing is uniform between the different users.This technique may be used for multiple applications, an example ofwhich is synchronizing different singers to music and/or each other.

The server(s) 120 may synchronize the modified first speech 922, themodified second speech 924 and the modified third speech 926(hereinafter, “modified speech”) using several techniques. In someexamples, the server(s) 120 may synchronize the modified speech byidentifying shared words that are common to each of the modified speech.For example, if three users are singing a song or repeating the samephrase, the server(s) 120 may detect words that are repeated by each ofthe three users and use these words to synchronize the modified speech.Additionally or alternatively, the server(s) 120 may synchronize themodified speech based on a tempo of a song. For example, if the threeusers are singing a song with a specific tempo, the modified speechshould share similar pauses or other timing and the server(s) 120 mayalign the modified speech based on a beat or other characteristicassociated with the timing. However, the disclosure is not limitedthereto and the server(s) 120 may align the modified speech using anytechnique known to one of skill in the art without departing from thedisclosure.

In some examples, the server(s) 120 may separate the first speech 912,the second speech 914 and the third speech 916 into three separatesections and generate separated speech 930 that plays the three separatesections sequentially. For example, the combined speech 910 maycorrespond to audio data from a teleconference or the like, withmultiple users speaking at the same time. While the combined speech 910may be difficult to understand due to the overlapping speech, theseparated speech 930 only includes speech associated with a single userat a time. Thus, the separated speech 930 may be more easily understoodby the server(s) 120 and/or by a user.

FIGS. 10A-10B are block diagrams conceptually illustrating examplecomponents of a system for voice enhancement according to embodiments ofthe present disclosure. In operation, the system 100 may includecomputer-readable and computer-executable instructions that reside onthe device(s) 110/server(s) 120, as will be discussed further below.

The system 100 may include one or more audio capture device(s), such asmicrophone(s) 112 or an array of microphones 112. The audio capturedevice(s) may be integrated into the device 110 or may be separate.

The system 100 may also include an audio output device for producingsound, such as loudspeaker(s) 114. The audio output device may beintegrated into the device 110 or may be separate.

As illustrated in FIGS. 10A-10B, the device(s) 110/server(s) 120 mayinclude an address/data bus 1002 for conveying data among components ofthe device(s) 110/server(s) 120. Each component within the device(s)110/server(s) 120 may also be directly connected to other components inaddition to (or instead of) being connected to other components acrossthe bus 1002.

The device(s) 110/server(s) 120 may include one or morecontrollers/processors 1004, that may each include a central processingunit (CPU) for processing data and computer-readable instructions, and amemory 1006 for storing data and instructions. The memory 1006 mayinclude volatile random access memory (RAM), non-volatile read onlymemory (ROM), non-volatile magnetoresistive (MRAM) and/or other types ofmemory. The device(s) 110/server(s) 120 may also include a data storagecomponent 1008, for storing data and controller/processor-executableinstructions (e.g., instructions to perform the algorithm illustrated inFIGS. 1, 4A and/or 4B). The data storage component 1008 may include oneor more non-volatile storage types such as magnetic storage, opticalstorage, solid-state storage, etc. The device(s) 110/server(s) 120 mayalso be connected to removable or external non-volatile memory and/orstorage (such as a removable memory card, memory key drive, networkedstorage, etc.) through the input/output device interfaces 1010.

The device(s) 110/server(s) 120 includes input/output device interfaces1010, such as the microphone(s) 112 and/or the speaker(s) 114. A varietyof components may be connected through the input/output deviceinterfaces 1010.

The input/output device interfaces 1010 may be configured to operatewith network(s) 10, for example a wireless local area network (WLAN)(such as WiFi), Bluetooth, ZigBee and/or wireless networks, such as aLong Term Evolution (LTE) network, WiMAX network, 3G network, etc. Thenetwork(s) 10 may include a local or private network or may include awide network such as the internet. Devices may be connected to thenetwork(s) 10 through either wired or wireless connections.

The input/output device interfaces 1010 may also include an interfacefor an external peripheral device connection such as universal serialbus (USB), FireWire, Thunderbolt, Ethernet port or other connectionprotocol that may connect to network(s) 10. The input/output deviceinterfaces 1010 may also include a connection to an antenna (not shown)to connect one or more network(s) 10 via an Ethernet port, a wirelesslocal area network (WLAN) (such as WiFi) radio, Bluetooth, and/orwireless network radio, such as a radio capable of communication with awireless communication network such as a Long Term Evolution (LTE)network, WiMAX network, 3G network, etc.

As discussed above with regard to FIG. 2, the server(s) 120 may includean orchestrator component 230, a speech processing component 240(including a speech recognition component 250 and a natural languagecomponent 260), user profile storage 270, a text-to-speech (TTS)component 280, one or more application(s) 290 and/or a user recognitioncomponent 295, as illustrated in FIG. 10A. In addition, the device 110may optionally include a wakeword detection component 220, asillustrated in FIG. 10B.

The device(s) 110/server(s) 120 may include a speech speed modificationmodule 1020, which may comprise processor-executable instructions storedin storage 1008 to be executed by controller(s)/processor(s) 1004 (e.g.,software, firmware, hardware, or some combination thereof). For example,components of the speech speed modification module 1020 may be part of asoftware application running in the foreground and/or background on thedevice(s) 110/server(s) 120. The speech speed modification module 1020may control the device(s) 110/server(s) 120 as discussed above, forexample with regard to FIGS. 1, 4A and/or 4B. Some or all of thecontrollers/components of the speech speed modification module 1020 maybe executable instructions that may be embedded in hardware or firmwarein addition to, or instead of, software. In one embodiment, thedevice(s) 110/server(s) 120 may operate using an Android operatingsystem (such as Android 4.3 Jelly Bean, Android 4.4 KitKat or the like),an Amazon operating system (such as FireOS or the like), or any othersuitable operating system.

Executable computer instructions for operating the device(s)110/server(s) 120 and its various components may be executed by thecontroller(s)/processor(s) 1004, using the memory 1006 as temporary“working” storage at runtime. The executable instructions may be storedin a non-transitory manner in non-volatile memory 1006, storage 1008, oran external device. Alternatively, some or all of the executableinstructions may be embedded in hardware or firmware in addition to orinstead of software.

Multiple device(s) 110/server(s) 120 may be employed in a single system100. In such a multi-device system, each of the device(s) 110/server(s)120 may include different components for performing different aspects ofthe process. The multiple device(s) 110/server(s) 120 may includeoverlapping components. The components of the device(s) 110/server(s)120, as illustrated in FIGS. 10-10B, are exemplary, and may be located astand-alone device or may be included, in whole or in part, as acomponent of a larger device or system.

The concepts disclosed herein may be applied within a number ofdifferent devices and computer systems, including, for example,general-purpose computing systems, server-client computing systems,mainframe computing systems, telephone computing systems, laptopcomputers, cellular phones, personal digital assistants (PDAs), tabletcomputers, video capturing devices, video game consoles, speechprocessing systems, distributed computing environments, etc. Thus thecomponents, components and/or processes described above may be combinedor rearranged without departing from the scope of the presentdisclosure. The functionality of any component described above may beallocated among multiple components, or combined with a differentcomponent. As discussed above, any or all of the components may beembodied in one or more general-purpose microprocessors, or in one ormore special-purpose digital signal processors or other dedicatedmicroprocessing hardware. One or more components may also be embodied insoftware implemented by a processing unit. Further, one or more of thecomponents may be omitted from the processes entirely.

The above embodiments of the present disclosure are meant to beillustrative. They were chosen to explain the principles and applicationof the disclosure and are not intended to be exhaustive or to limit thedisclosure. Many modifications and variations of the disclosedembodiments may be apparent to those of skill in the art. Persons havingordinary skill in the field of computers and/or digital imaging shouldrecognize that components and process steps described herein may beinterchangeable with other components or steps, or combinations ofcomponents or steps, and still achieve the benefits and advantages ofthe present disclosure. Moreover, it should be apparent to one skilledin the art, that the disclosure may be practiced without some or all ofthe specific details and steps disclosed herein.

Embodiments of the disclosed system may be implemented as a computermethod or as an article of manufacture such as a memory device ornon-transitory computer readable storage medium. The computer readablestorage medium may be readable by a computer and may compriseinstructions for causing a computer or other device to perform processesdescribed in the present disclosure. The computer readable storagemedium may be implemented by a volatile computer memory, non-volatilecomputer memory, hard drive, solid-state memory, flash drive, removabledisk and/or other media.

Embodiments of the present disclosure may be performed in differentforms of software, firmware and/or hardware. Further, the teachings ofthe disclosure may be performed by an application specific integratedcircuit (ASIC), field programmable gate array (FPGA), or othercomponent, for example.

Conditional language used herein, such as, among others, “can,” “could,”“might,” “may,” “e.g.,” and the like, unless specifically statedotherwise, or otherwise understood within the context as used, isgenerally intended to convey that certain embodiments include, whileother embodiments do not include, certain features, elements and/orsteps. Thus, such conditional language is not generally intended toimply that features, elements and/or steps are in any way required forone or more embodiments or that one or more embodiments necessarilyinclude logic for deciding, with or without author input or prompting,whether these features, elements and/or steps are included or are to beperformed in any particular embodiment. The terms “comprising,”“including,” “having,” and the like are synonymous and are usedinclusively, in an open-ended fashion, and do not exclude additionalelements, features, acts, operations, and so forth. Also, the term “or”is used in its inclusive sense (and not in its exclusive sense) so thatwhen used, for example, to connect a list of elements, the term “or”means one, some, or all of the elements in the list.

Conjunctive language such as the phrase “at least one of X, Y and Z,”unless specifically stated otherwise, is to be understood with thecontext as used in general to convey that an item, term, etc. may beeither X, Y, or Z, or a combination thereof. Thus, such conjunctivelanguage is not generally intended to imply that certain embodimentsrequire at least one of X, at least one of Y and at least one of Z toeach is present.

As used in this disclosure, the term “a” or “one” may include one ormore items unless specifically stated otherwise. Further, the phrase“based on” is intended to mean “based at least in part on” unlessspecifically stated otherwise.

What is claimed is:
 1. A computer-implemented method, comprising:receiving input audio data representing a voice command; determining aninput speech speed corresponding to the input audio data; determiningoutput data responsive to the voice command; determining first dataassociated with a user profile corresponding to the voice command, thefirst data representing an incoming communication request for the userprofile; determining a target output speed based at least in part on theinput speech speed and the first data; using the output data to generateoutput audio data representing output speech, the output speechcorresponding to the target output speed; and causing a device to outputthe output audio data.
 2. The computer-implemented method of claim 1,further comprising: determining preference data corresponding to thevoice command, the preference data representing at least one of apreviously selected target output speed, a previously used target outputspeed, or location data associated with the preference data, and whereinthe target output speed is determined further based at least in part onthe preference data.
 3. The computer-implemented method of claim 1,further comprising: determining the target output speed corresponding toa first portion of the output data; determining a second target outputspeed corresponding to a second portion of the output data; and whereina first portion of the output speech corresponds to the target outputspeed and a second portion of the output speech corresponds to thesecond target output speed.
 4. The computer-implemented method of claim3, further comprising: determining a difference between the targetoutput speed and the second target output speed; dividing the differenceby a maximum transition value to determine a number of increments; anddetermining one or more intermediate target output speeds correspondingto a third portion of the output audio data, the third portion beingbetween the first portion and the second portion, a number of the one ormore intermediate target output speeds corresponding to the number ofincrements, wherein the first portion of the output speech correspondsto the target output speed, the second portion of the output speechcorresponds to the second target output speed, and a third portion ofthe output speech corresponds to the one or more intermediate targetoutput speeds.
 5. The computer-implemented method of claim 1, furthercomprising: determining an input volume level associated with the voicecommand; determining a target volume level based at least in part on theinput volume level; and associating the target volume level with theoutput audio data.
 6. The computer-implemented method of claim 1,wherein: the voice command includes a command to play a voice message;the output data represents audio data corresponding to the voicemessage; the method further comprises determining a message speech speedassociated with the output data; and determining the target output speedcomprises determining the target output speed based at least in part onthe input speech speed and the message speech speed.
 7. Thecomputer-implemented method of claim 6, further comprising: determininga first user profile corresponding to the voice command; determiningfirst preference data associated with the first user profile, the firstpreference data indicating at least one of a previously selected targetoutput speed, a previously used target output speed, or location dataassociated with the first user profile; determining a second userprofile corresponding to the voice message; and determining secondpreference data associated with the second user profile, the secondpreference data indicating at least one of a preferred output speed forthe voice message, wherein determining the target output speed comprisesdetermining the target output speed based at least in part on one of theinput speech speed, the message speech speed, the first preference dataor the second preference data.
 8. The computer-implemented method ofclaim 6, wherein: the output data includes a representation of firstspeech associated with a first user profile and a representation ofsecond speech associated with a second user profile, wherein the messagespeech speed is associated with the first speech and the target outputspeed is associated with the first speech, and the method furthercomprises: determining a second message speech speed associated with thesecond speech; determining a second target output speed corresponding tothe second speech; and using the output data to generate the outputaudio data representing the output speech, a first portion of the outputspeech corresponding to the target output speed and a second portion ofthe output speech corresponding to the second target output speed. 9.The computer-implemented method of claim 1, further comprising:determining playback speed preferences associated with the user profile;determining configuration data corresponding to information about atleast one of the user profile or the voice command; determining qualitydata corresponding to an audio quality of the output data, and whereindetermining the target output speed comprises determining the targetoutput speed based at least in part on one of the input speech speed,the configuration data, the playback speed preferences, or the qualitydata.
 10. The computer-implemented method of claim 1, furthercomprising: determining a plurality of positions in the output data inwhich to insert a duration of silence, the plurality of positionsincluding a first position; and generating the output audio data usingthe output data, the output audio data including the duration of silenceat the first position.
 11. A system comprising: at least one processor;and memory including instructions operable to be executed by the atleast one processor to configure the system to: receive input audio datarepresenting a voice command; determine an input speech speedcorresponding to the input audio data; determine output data responsiveto the voice command; determine quality data corresponding to an audioquality of the output data; determine a target output speed based atleast in part on the input speech speed and the quality data; using theoutput data, generate output audio data representing output speech, theoutput speech corresponding to the target output speed; and cause adevice to output the output audio data.
 12. The system of claim 11,wherein the memory further includes instructions that, when executed,further configure the system to: determine a user profile correspondingto the voice command; determine first data corresponding to the voicecommand, the first data representing at least one of a previouslyselected target output speed, a previously used target output speed, orlocation data associated with the user profile, and wherein the targetoutput speed is determined further based at least in part on the firstdata.
 13. The system of claim 11, wherein the memory further includesinstructions that, when executed, further configure the system to:determine urgency data associated with a user profile corresponding tothe voice command, the urgency data representing at least one oflocation data associated with the user profile, calendar data associatedwith the user profile, or incoming communication data associated withthe user profile, and wherein the target output speed is determinedfurther based on at least in part the urgency data.
 14. The system ofclaim 11, wherein the memory further includes instructions that, whenexecuted, further configure the system to: determine the target outputspeed corresponding to a first portion of the output data; determine asecond target output speed corresponding to a second portion of theoutput data; wherein a first portion of the output speech corresponds tothe target output speed and a second portion of the output speechcorresponds to the second target output speed.
 15. The system of claim14, wherein the memory further includes instructions that, whenexecuted, further configure the system to: determine a differencebetween the target output speed and the second target output speed;divide the difference by a maximum transition value to determine anumber of increments; determine one or more intermediate target outputspeeds corresponding to a third portion of the output audio data, thethird portion being between the first portion and the second portion, anumber of the one or more intermediate target output speedscorresponding to the number of increments; and wherein the first portionof the output speech corresponds to the target output speed, the secondportion of the output speech corresponds to the second target outputspeed, and a third portion of the output speech corresponds to the oneor more intermediate target output speeds.
 16. The system of claim 11,wherein the memory further includes instructions that, when executed,further configure the system to: determine an input volume levelassociated with the voice command; determine a target volume level basedat least in part on the input volume level; associate the target volumelevel with the output audio data.
 17. The system of claim 11, wherein:the voice command includes a command to play a voice message, the outputdata represents audio data corresponding to the voice message, thememory further includes instructions that, when executed, furtherconfigure the system to determine a message speech speed associated withthe audio data, and the instruction to determine the target output speedfurther configures the system to determine the target output speed basedat least in part on the input speech speed and the message speech speed.18. The system of claim 11, wherein the memory further includesinstructions that, when executed, further configure the system to:determine a user profile corresponding to the voice command; determineconfiguration data corresponding to information about at least one ofthe user profile or the voice command; and determine a stored outputspeed preference represented in the user profile, wherein theinstructions that configure the system to determine the target outputspeed further configure the system to determine the target output speedbased at least in part on one of the input speech speed, the storedoutput speed preference, the configuration data, or the quality data.19. The system of claim 17, wherein the memory further includesinstructions that, when executed, further configure the system to:determine a first user profile corresponding to the voice command;determine first preference data associated with the first user profile,the first preference data indicating at least one of a previouslyselected target output speed, a previously used target output speed, orlocation data associated with the first user profile; determine a seconduser profile corresponding to the voice message; and determine secondpreference data associated with the second user profile, the secondpreference data indicating at least one of a preferred output speed forthe voice message, and wherein the instructions that configure thesystem to determine the target output speed further configure the systemto determine the target output speed based at least in part on one ofthe input speech speed, the message speech speed, the first preferencedata or the second preference data.